STARS - 2019 - Annual activity report

STARS

STARS - 2019

Project-Team Stars

Team, Visitors, External Collaborators

Overall Objectives

Presentation

Research Program

Highlights of the Year

New Software and Platforms

New Results

Bilateral Contracts and Grants with Industry

Bilateral Contracts and Grants with Industry

Partnerships and Cooperations

Dissemination

Bibliography

Previous |

Home | Next next

Section: New Results

Self-Attention Temporal Convolutional Network for Long-Term Daily Living Activity Detection

Participants : Rui Dai, François Brémond.

This year, we proposed a Self-Attention - Temporal Convolutional Network (SA-TCN), which is able to capture both complex activity patterns and their dependencies within long-term untrimmed videos [34]. This attention block can also embed with other TCN-nased models. We evaluate our proposed model on DAily Home LIfe Activity Dataset (DAHLIA) and Breakfast datasets. Our proposed method achieves state-of-the-art performance on both datasets.

Work Flow

Given an untrimmed video, we represent each non-overlapping snippet by a visual encoding over 64 frames. This visual encoding is the input to the encoder-TCN, which is the combination of the following operations: 1D temporal convolution, batch normalization, ReLu, and max pooling. Next, we send the output of the encoder-TCN into the self-attention block to capture long-range dependencies. After that, the decoder-TCN applies the 1D convolution and up sampling to recover a feature map of the same dimension as visual encoding. Finally, the output will be sent to a fully connected layer with softmax activation to get the prediction. Fig 18 and 19 provide the structure of our model.

Figure 18. Overview. The model contains mainly three parts: (1) visual encoding, (2) encoder-decoder structure, (3) attention block

Figure 19. Attention block. This figure presents the structure of attention block

Result

We evaluated the proposed method on two daily-living activity datasets (DAHLIA, Breakfast) and achieved state-of-the-art performances. We compared with these following State-of the arts: DOHT, Negin et al., GRU , ED-TCN, TCFPN.

**Table 2.** Activity detection results on DAHLIA dataset with the average of view 1, 2 and 3. * marked methods have not been tested on DAHLIA in their original paper.
Model	FA1	F-score	IoU	mAP
DOHT	0.803	0.777	0.650	-
GRU $^{*}$	0.759	0.484	0.428	0.654
ED-TCN $^{*}$	0.851	0.695	0.625	0.826
Negin et al.	0.847	0.797	0.723	-
TCFPN $^{*}$	0.910	0.799	0.738	0.879
SA-TCN	0.921	0.788	0.740	0.862

**Table 3.** Activity detection results on Breakfast dataset.
Model	FA1	F-Score	IoU	mAP
GRU	0.368	0.295	0.198	0.380
ED-TCN	0.461	0.462	0.348	0.478
TCFPN	0.519	0.453	0.362	0.466
SA-TCN	0.497	0.494	0.385	0.480

**Table 4.** Average precision of ED-TCN on DAHLIA.
Activities	Background	House work	Working	Cooking
AP	0.36	0.65	0.95	0.96
Activities	Laying table	Eating	Clearing table	Wash dishes
AP	0.90	0.97	0.80	0.97

**Table 5.** Combination of attention block with other TCN-based model: TCFPN. (Evaluated on DAHLIA dataset)
Model	FA1	F-score	IoU	mAP
TCFPN	0.910	0.799	0.738	0.879
SA-TCFPN	0.917	0.799	0.748	0.894

Figure 20. Detection visualization. The detection visualization of video 'S01A2K1' in DAHLIA: (1) ground truth, (2) GRU, (3) ED-TCN, (4) TCFPN and (5) SA-TCN.

Previous |

Home | Next next